The project is to analyze the correlation between diabetes and several social determinants, including race, income, insurance, and clinical distance. The goal is to understand how these factors influence the prevalence and management of diabetes, providing insights that can inform public strategies, healthcare policies, and intervention programs.
I spoke to Dr. Richard Tsui abotu my project, he guided me to choose a specific social determinants that directly correlates to the disease I want to learn more about.
2 Introduction
According to the CDC, in 2020, 38.4 million people in the United States of all ages had diabetes. Diabetes was the eighth leading cause of death in the United States. In an article called “Overview of Social Determinants of Health in the Development of Diabetes” from the Diabetes Journals stated that diabetes has a long-standing, well-documented socioeconomic and racial/ethnic inequalities in disease prevalence and incidence, morbidity and mortality. Higher diabetes prevalence is associated with lower education, lower income, and non-White race/ethnicity.
World Health Organization (WHO) Commission defined Social Determinants of Health (SDOH) as “the conditions in which people are born, grow, live, work and age, and the wider set of forces and systems shaping the conditions of daily life”. SDOH attributed between 30%-55% of health outcome and they viewed as the main driver of avoidable health inequities. Due to the association between social determinants of health and diabetes, I would like to learn and conduct an analysis on the following factors: race, income, and insurance, distance to clinics.
3 Methods
The dataset will be used in this analysis are: #1. Diabetes = BRFSS dataset in 2022 #2. Social determinants = the Agency for Healthcare Research and Quality’s Social Determinants of Health Database, the data I used is the Census Tract of 2020.
#Loading the necessary packageslibrary(readxl)library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(sf)
Linking to GEOS 3.12.2, GDAL 3.9.3, PROJ 9.4.1; sf_use_s2() is TRUE
library(tigris)
To enable caching of data, set `options(tigris_use_cache = TRUE)`
in your R script or .Rprofile.
library(leaflet)library(maps)
Attaching package: 'maps'
The following object is masked from 'package:purrr':
map
Rows: 235 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (21): LocationAbbr, LocationDesc, Class, Topic, Indicator, Response, Dat...
dbl (7): ID, Year, Low_Confidence_Limit, High_Confidence_Limit, Sample_Size...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Only choosing dataset for PA and WVPA_WV_diabetes <- diabetes %>%filter(LocationDesc %in%c("Pennsylvania", "West Virginia"))#Having the variable Data_Value as numeric in order to graph/compare laterdiabetes$Data_Value <-as.numeric(diabetes$Data_Value)
Warning: NAs introduced by coercion
#Looking for the state with the highest diabetes diagnosis, excluding U.S. islandsmost_diabetes <- diabetes %>%filter(Response =="Yes", !grepl("median", LocationDesc, ignore.case =TRUE), !LocationDesc %in%c("Guam", "Puerto Rico", "Virgin Islands")) %>%select(LocationDesc, Response, Data_Value) %>%arrange(desc(Data_Value)) %>%head(1)# Print the resultsummary(most_diabetes)
LocationDesc Response Data_Value
Length:1 Length:1 Min. :17.4
Class :character Class :character 1st Qu.:17.4
Mode :character Mode :character Median :17.4
Mean :17.4
3rd Qu.:17.4
Max. :17.4
From the output above, the state with the highest population diagnosed with diabetes is West Virginia.
Showing the percentage of population with diabetes from Pennsylvania and West Virginia
# A tibble: 2 × 3
LocationDesc Response Data_Value
<chr> <chr> <chr>
1 Pennsylvania Yes 11.5
2 West Virginia Yes 17.4
Providing a visualization on the percentage of population with diabetes based on all states in the United States
#Diabetes = yes data diabetes_yes <- diabetes %>%filter(Response =="Yes", !grepl("median", LocationDesc, ignore.case =TRUE), !LocationDesc %in%c("Guam", "Puerto Rico", "Virgin Islands"))#Download counties data on every states in U.S.counties1 <-counties(cb =TRUE, class ="sf")
#Joining diabetes and counties to mapdiabetes_map <-inner_join(diabetes_yes, counties1, by =c("LocationAbbr"="STUSPS"))#Making the diabetes as sf data to mapdiabetes_map <-st_as_sf(diabetes_map)#Create color palette based on the percentage of diabetes pal <-colorNumeric(palette ="YlOrRd",domain = diabetes_map$Data_Value)# Create a popup showing diabetes datamap_data <- diabetes_map %>%mutate(popup_info =paste0("<b>State:</b> ", LocationAbbr, "<br>","<b>White:</b> ", Data_Value, "<br>" ))# Generate the mapdiabetes_by_state <-leaflet(data = map_data) %>%addTiles() %>%addPolygons(fillColor =~pal(Data_Value), fillOpacity =0.5,color ="black",weight =1,popup =~popup_info ) %>%fitBounds(lng1 =min(st_bbox(diabetes_map)$xmin), # Minimum longitudelat1 =min(st_bbox(diabetes_map)$ymin), # Minimum latitudelng2 =max(st_bbox(diabetes_map)$xmax), # Maximum longitudelat2 =max(st_bbox(diabetes_map)$ymax) # Maximum latitude )
Warning: sf layer has inconsistent datum (+proj=longlat +datum=NAD83 +no_defs).
Need '+proj=longlat +datum=WGS84'